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Abstract 

It is common that in multiarm randomized trials, the outcome of interest is “truncated by death,” 
meaning that it is only observed or well dehned conditioning on an intermediate outcome. In this case, 
in addition to pairwise contrasts, the joint inference for all treatment arms is also of interest. Under 
a monotonicity assumption we present methods for both pairwise and joint causal analyses of ordinal 
treatments and binary outcomes in presence of truncation by death. We illustrate via examples the 
appropriateness of our assumptions in different scientific contexts. 

Keywords: Bayesian analysis; Causal inference; Multiarm trials; Ordinal treatment variable; Principal 
stratification; Survey incentives 


1 Introduction 


In multiarm randomized trials, researchers are often interested in analyzing treatment effects on an outcome 
that is measured or well defined only when an intermediate outcome takes certain values ( Robins[ 1986| 
Rubin 2000 2006t Egleston et al.[[2007 Chiba and VanderWeele[ 201 1| Ding et aLj 201 1[ ). For example. 


consider a multiarm randomized HIV vaccine trial. Scientists might be interested in evaluating vaccine 
effects on HIV viral load as it correlates with infectiousness and disease progression ([Hudgens et al.[ 2003 1 
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Gilbert et al. 20031. However, HIV viral load is typically measured only for infected individuals. Two 


problems occur in this case: in general, there are many potential comparisons that can be made between 
different vaccination groups among infected subjects; moreover, these comparisons are subject to selection 
bias as the vaccine may affect susceptibility to HIV infection. In the simple case of a two-arm trial, to deal 
with the selection bias problem, several authors have proposed to consider the vaccine effects on viral load 
among the always-infected stratum, the subpopulation who would become infected regardless of whether 
they are vaccinated or not (e.g., Hudgens et al.[ 2003} Gilbert et al. 20031. However, there has not been 
much work on analyzing this type of trial with more than two arms. 

By convention, the intermediate outcome is called “survival,” and we say the final outcome is “truncated 
by death” if it is only observed and/or well-defined for “survivors.” Thus in the HIV vaccine example above, 
the always-infected stratum is referred to as the “always-survivor” stratum. The causal contrast among the 


always-infected subjects is hence called the (always-)survivor average causal effect (SAGE) (Rubin 2000 
|Robins|[T986l §12.2). 

In general, even in a two-arm trial, the SAGE is not identifiable wifhouf sfrong unfesfable assumptions. 
As a resulf, fhere are no consisfenf fesfs for defecting non-null vaccine effecfs in fhe always-infecfed sfrafum. 


Insfead, under some reasonable assumpfions, Hudgens el al. (20031 lesled fhe null hypolhesis presuming 


fhe maximal degree of seleclion bias. Their approach is relafed lo eslimalion of bounds on SAGE, which 


has been exlensively sludied in lileralure. Eor example, Zhang and Rubin (2003) developed bounds on 
SAGE under various assumpfions including fhe monolonicily assumpfion and fhe slochaslic dominance 
assumplion. Imai[ ( |2008 1 provided an allernalive proof lhal fhe bounds of [Zhang and Rubin (20031 are 
sharp by formulating fhe Iruncalion-by-dealh problem as a “confaminafed dafa” problem. These fesling and 
eslimalion melhods are appealing in practice as Ihey donT rely on sfrong idenlifiabilily assumpfions. 

However, so far as we are aware, fhere has nol been much discussion on fesling and eslimalion of 


SACEs in a mulliarm frial, which is fairly common in medical practice (Schulz and Grimes 20051. Prior lo 


our work, Eee el al. (2010l considered a sensilivily analysis approach lo idenlify all SACEs in a Ihree-arm 


frial. Their idenlificalion resulls rely on a sfrong paramelric assumption and several sensilivily parameters. 
In fhis arficle, we insfead propose a framework lo sysfemafically analyse SACEs in a general mulliarm frial 
wifhouf sfrong identification assumpfions. To fhe besl of our knowledge, our melhod is also fhe firsl lhal is 
readily applicable lo randomized Irials wilh more lhan Ihree Irealmenl arms under Iruncalion by dealh. 

The fesling and eslimalion of SACEs in a mulliarm frial are more challenging compared lo Iwo-arm 
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trials. Firstly, in general there are many different SACEs that are well-defined. As we show later in Section 


2.2 consideration of all SACEs (as in Eee et al. (2010)) can lead to paradoxical non-transitive conclusions. 


Hence we instead restrict our attention to comparisons within the “finest” (principal) strata, thereby avoiding 
this difficulty. Secondly, one needs to distinguish between an overall analysis of treatment effects and 
a separate analysis of each individual contrast. In the simple setting without truncation by death, it is 
widely known that compared to all pairwise comparisons with correction for multiple comparisons, an 
overall analysis such as an ANOVA test often provides more power for testing the overall treatment effect 
in a multiarm trial. When truncation by death is present, because of the non-identifiability of SACEs, this 
advantage becomes more fundamental as non-identifiability remains even when the sample size goes to 


infinity. In contrast to Eee et al. (2010), we distinguish between simultaneous versus marginal inference 
for SACEs, and argue that they should be used to answer different questions. In particular, we show that 
compared to marginal inference procedures, our proposed simultaneous inference procedures provide more 
power for testing the overall treatment effect and the advantage remains even with an infinite sample size. 
Thirdly, the simultaneous inference problem is unique to a multiarm trial. Again, since SACEs are not 
identifiable, traditional statistical inference tools for multiarm trials without truncation by death are not 
directly applicable to our setting. Instead, we develop novel simultaneous inference procedures to test an 
overall treatment effect, and show that they have desirable asymptotic properties. We also generalize the 
marginal inference procedures for a two-arm trial to get sharp bounds on SACEs for a general multiarm 
trial. To focus on addressing these challenges, in this paper, we restrict our attention to trials with ordinal 
treatment groups and binary outcomes. 

The rest of this paper is organized as follows. In Section we introduce our notations, assumptions 
and define our causal estimands. We also address the transitivity issue and identify three specific testing and 
estimation questions that may arise in a general multiarm trial with truncation by death. We then propose 
three novel procedures that answer these questions in Sections[^|^and[^ In Sectionj^ we discuss the unique 
challenges for hypothesis testing with non-identifiable parameters, and develop a novel step-down testing 
procedure to test the overall treatment effect in this situation. In Section]^ we develop a linear programming 
algorithm to test an overall clinically relevant treatment effect. In Section we derive the sharp marginal 
bounds for each causal contrast of interest. In Section]^ we illustrate the proposed procedure with real data 
analyses. Results from simulation studies can be found in the the Supplementary Materials. We end with a 
discussion in Section |7] 
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The programs that were used to analyse the data can be obtained from 

http://wileyonlinelibrary.com/journal/rss-datasets 


2 Framework 


2.1 Data structure and assumptions 


Consider a multiarm trial with a control arm and multiple arms of active treatment. Let Z be an ordinal 
treatment variable, where Z = 0 corresponds to the control treatment, and Z G m} corresponds to 

different arms of active treatment. In what follows, we use the terminology “treatment arms” and “treatment 
levels” interchangeably. We assume that each subject has m + 1 dichotomous potential outcomes Y(z),z = 
0,..., m, where Y(z) is defined as the outcome that would have been observed if the subject had been 
assigned to treatment arm z. Similarly, we define S{z) as fhe pofenfial survival sfafus under frealmenf 
assignmenf 2 ;. We assume Y(z) is well-defined only if S{z) = 1. In ofher words, fhe oufcome of inferesf 
is well-defined only for subjecfs who survive fo fhe follow-up visif. We also assume fhaf fhe observed dafa 
{Zi, Si,Yi;i = 1,..., N) are independenfly drawn from an infinite super-populafion. 


Lef G = (^(O),..., S{m)) denotes fhe basic principal stratum (Frangakis and Rubin 20021. If we lef 
the letter L denote = 1 (meaning “live”) and the letter D denote = 0 (meaning “die”), then G can 
be rewritten as a string consisting of the letters “L” and “D.” For example, in a three-arm trial, Gi = DLL 
indicates that subject i would die under control, but would survive under active treatment 1 or 2. 

We make the following assumptions. 


Assumption 1: Stable unit treatment value assumption (SUTVA (Rubin 1980l): there is no interference 
between units, and there is only one version of treatment. 


Under the SUTVA, the observed outcome equals the potential outcome under the observed treatment 
arm, namely Y = Y(Z) and S = S{Z). 

Assumption 2: Random treatment assignment: Z II (>9(0)...., S{m),Y (0),..., U(m)). 

Assumption 3: Monotonicity: Si{zi) > Si{z 2 ),i = 1, ■ ■ ■, N, zi > Z 2 - 

The monotonicity assumption is sometimes plausible in social science studies if the treatment options 
can be reasonably ordered. For example, in randomized experiments evaluating the effect of incentives on 


4 








survey response quality, it is intuitive that higher level of incentives would not hurt survey response rates. 
This assumption tends to be more controversial in medical studies where S represents survival, in which 
there are often trade-offs between treatment benefits and side effects. 

The only possible strata under the monotonicity assumption are strata of the form D ■ ■ ■ DL ■ ■ ■ L. To 
compress notation, we denote all possible principal strata as fc = 0, ..., m + 1), where mem¬ 
bers of principal stratum would die if assigned to the first k treatment arms but would survive 

if assigned to the remaining m -|- 1 — A: treatment arms. 


2.2 Causal estimands and questions 


For randomized trials with two treatment arms, it is common to estimate the average causal effect in the 
LL stratum ( [Kalbfleisch and Prentice 1980 Robins 1986} |Rubin[ |2000| ), the only subgroup for which 
both of the potential outcomes are well-defined: SACE = E[Y{1) — y(0) | G = LL]. In a general 
mulfiarm frial, researchers may be inferesfed in comparisons of pofenfial oufcomes wifhin fhe same basic 
principal sfrafum. For example, in fhe case where we have fhree levels of freafmenf: 0, 1, 2, fhe largel 
esfimands are E[Y{2) - y(l) | G = LLL],E[Y{1) - y(0) | G = LLL],E[Y{2) - y(0) | G = LLL] 
and E[Y{2) — y(l) | G = DLL], These confrasfs are causally meaningful as fhe memberships of basic 
principal sfrata are defined af baseline. 

To define fhe causal esfimands for a general mulfiarm frial, we first introduce some notation. Let = 
E[Y{z) I G = g] denote the mean potential outcome under treatment assignment z in basic principal 
stratum g. Also, let Ai{g) denote the minimal treatment level under which members of principal stratum 
g can survive. In other words, for members of principal stratum g, S{z) = 1 if and only if z > M{g). 
Consequently, Pg is well-defined if and only if z > M.{g). Under fhe monofonicify assumpfion, all basic 
principal sfrafa lake fhe form g = _ gy definilion, = k. Also lef = {p : 

Ai{g) < kf denote the collection of basic principal strata whose members would survive if assigned to 
treatment arm k. The pairwise causal estimands in a multiarm trial then take the form 


A{zi,Z 2 ]g) ^ Pg^ - where pe > Z 2 > M{g). (1) 

For notational simplicity, in this article, when we write the notation Pg and A( 2 ;i, Z 2 ; g), we always assume 
that it is well-defined. We also note fhaf fhe parameters involved in defining fhe causal confrasfs A(zi, Z 2 ] g) 
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are contained in the parameter vector 9 ^ ^m-i, z 

Other meaningful causal contrasts are made within coarsened principal strata, defined as groups that 
combine several basic principal strata ( Cheng and Small[ 2006[ ). For example, in the case of a three-arm 
trial, the contrast E[Y{2) —Y{1) \ G G {LLL, DLL}] is also causally meaningful as memberships of 
the coarsened principal strata {LLL, DLL} are also defined at baseline. Some previous researchers hence 
consider coarsened principal strata causal effects together with basic principal strata causal effects (e.g. |Lee| 
et al.j 2010| ). However, as Robins (1986) noted, if one were to compare E\Y{2) — y(0) | G = LLL], 
E\Y (1) — y(0) I G = LLL] and E\Y (2) — y(l) | G G {LLL, DLL}] simultaneously, it is possible that 
the last two comparisons are both positive while the first one is negative. This lack of transitivity limits the 
interpretability of causal effects defined within coarsened principal strata. In contrast, transitivity holds if 
limited to basic principal strata (e.g., LLL). Hence in this article, we are primarily interested in comparisons 
between potential outcomes in the same basic principal stratum. 

On the other hand, as [Robins et al. ( 2007 j ) noted, the size of each basic principal stratum is likely to be 
very small and consequently, each comparison in ([^ only applies to a small portion of the population. Hence 
for randomized trials with more than three treatment arms, we may have limited power to test treatment 
effects for each basic principal stratum. What is more, we run into the problem of multiple comparisons as 
there are multiple treatment arms and multiple basic principal strata. 

Therefore, we first consider testing the global null hypothesis that the treatment is not effective in any of 
the basic principal strata (for which some treatment comparison is well-defined). This question is scientifi¬ 
cally relevant. For example, in a HIV vaccine trial, testing the global null addresses whether there exists a 


mechanism through which the vaccine alters viral load in infected individuals ( [Shepherd et al.[[2006| ). Sec¬ 
ondly, clinicians may also be interested in whether the overall treatment effect is clinically meaningful so 
that the active treatment is promising in clinical practice. For this purpose, an overall treatment effect may be 
declared only if it is greater than the clinical margin of relevance specified by clinicians. Finally, besides an 
overall treatment effect, scientists and clinicians may also be interested in isolating the non-zero/non-trivial 
causal contrasts. In summary, the following questions are of interest with a multiarm trial: 


1. Is there evidence of the existence of non-zero average treatment effects for at least one basic principal 
stratum between at least two treatment arms? 


2. Are there clinically relevant average treatment effects for at least one basic principal stratum between 
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at least two treatment arms? 


3. Can we find the specific principal strata and treatment arms that correspond to the overall non¬ 
zero/clinically relevant treatment effect, if such exists? 

We address these questions in Section[^[^and[^respectively. Existing causal analysis literature in multiarm 
trials with non-identifiable causal estimands ( [Cheng and Small} 2006} Long et ah] [2010} Lee et al. 20101 
focuses on answering the third question. However, as we explain later in Remark one may be able to 
answer the first two questions even if there is not enough information to answer the third. Hence it is 
important to consider all three questions. 


3 Testing treatment effects in a multiarm trial 


To find out if there is an overall non-zero treatment effect, it is desirable to consider the following testing 
problem: 

Uq-./\{ zi,Z 2 ]g) = Q,^zi,Z 2 ,g vs Ua ■■^zi,Z 2 ,g s.t. /\{zi,Z 2 ]g) ^ (2) 


where V means “for all,” 3 means “there exists” and s.t. means “such that.” The testing problem ^ is 
fundamentally different from (and more difficult than) a standard testing problem, in which one assumes 
if the observed data distribution was known, one would also know whether or not the hypothesis is true 


(Lehmann and Romano 20061. The main difficulty here is that Tfo is a statement about non-identifiable 
parameter vector Urn-i- other words, even if the population probabilities P{S = 1 \ Z = z) and 
P{Y = l\ S = l,Z = z) were known, we could only ascertain that ^^rn-l resides in a region, and 
therefore may not know whether Pq is true or not. 

Nevertheless, /r^-i “partially identifiable” in the sense that the observed data distribution can narrow 


down the range in which can possibly lie (Cheng and Small 20061. Lor example, in a three-arm trial, 
the domain of ^i 2 is [0,1]®. However, if the observed data distribution was known, the feasible region of ^2 
would be a subspace in [0,1]® subject to the following constraints: 


P^Y = ^\Z = 0,S = l)=^llLL, 

P{Y =1|Z' = 1,5 = 1)= PlllIJ-llL + POLLf^DLL^ 

PiY = l \ Z = 2,S = 1) = PlllPlll + PdllAll + PddlAdl^ 0) 
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where Pg = P{G = g\ Z = z,S = l)i?, identifiable under Assumptions and (see Lemma 1 in the 
Supplementary Materials). Figure [TJprovides a graphical representation of the functional relations described 
in Q. 



Figure 1: A graph representing the functional dependencies in the causal analysis of a three-arm random¬ 
ized trial with truncation by death. Rectangular nodes represent observed variables; oval nodes represent 
unknown parameters, with different shadings corresponding to different principal strata. Under the mono¬ 
tonicity assumption, Pg can be identified from observed quantities P{S = 1 \ Z = z). 

For a general multiarm trial, if the parameter space defined by T-Lq has no intersection with the feasible 
region of would know that Pq is not true. In general, we introduce the following notions for 

hypothesis testing with non-identifiable parameters. 

Definition 1: We define a hypothesis relating to a parameter to be compatible with an observed data distri¬ 
bution if the parameter space defined by the hypothesis has a non-empty intersection with the feasible region 
of the parameter under the observed data distribution. 

In particular, if a parameter is completely unidentifiable such that the observed data distribution imposes 
no constraints on the parameter, then all hypotheses relating to that parameter are compatible with the 
observed data distribution. On the other hand, if a parameter is identifiable so that its feasible region under 
the observed data distribution is always a single point set, then all compatible hypotheses are true. 

In general, however, not all compatible hypotheses are true. Nevertheless, owing to lack of identifiability, 
a true hypothesis may not be distinguishable from data with an untrue yet compatible hypothesis. This leads 
to the following notion of sharpness. 











Definition 2: We define a test to be sharp for testing a null hypothesis if when the null is not compatible 
with the observed data distribution (and is hence untrue), the power of the test tends to 1 when the sample 
size goes to infinity. 


Intuitively, similar to consistent tests, sharp tests are those that maximize power asymptotically. The 
difference is that as the sample size goes to infinity, with probability tending to 1, sharp tests reject any 
hypotheses that are incompatible with the observed data distribution, whereas consistent tests reject any 
hypotheses they are untrue. In small sample settings, however, the conclusions that one would draw from 
a sharp test are similar to those from a consistent test. If a hypothesis is rejected, one would conclude that 
it is untrue (at a certain significance level); if otherwise, no claims about the correctness of the hypothesis 


would be made. We also note that for a standard hypothesis testing problem as described in Lehmann 


and Romano (20061, sharp tests are the same as consistent tests. When the null hypothesis concerns non- 


identifiable parameters, however, there are in general no consistent tests. Instead, sharpness plays the role 
of consistency in a standard hypothesis testing problem. 

The notion of sharp tests is similar in spirit to the notion of sharp bounds, defined as the tightest possible 


bound given the observed data distribution (e.g., Imai 20081. This notion has also been used implicitly in 
previous works. For example, [Hudgens et al. ( [2003 1 ’s test for SACE in a two-arm trial is sharp. 

Below in Section |TT] we develop a sharp test for problem (|^ under the presumption that the observed 
data distribution is known. In other words, we assume the sample size is infinite such that there is no 
stochastic variation in the observed data. In SectionjT^we incorporate sampling uncertainty to our proposed 
test using a Bayesian method. 


3.1 A step-down procedure for testing the global null T-Lq 

To fix ideas, we first consider the problem of a three-arm trial, for which T-Lq holds if and only if 


0 _ 1 _ 2 
h-LLL — h-LLL — h-LLL 


(4) 


and 


1 _ 2 
h'DLL — h-DLL- 


(5) 
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We hence propose a two-step procedure. Firstly we test hypothesis Q. If Q is compatible with the observed 
data distribution, we then test if Q is compatible with the observed data distribution conditioning on Q. 

Specifically, one can see from Figure that is identifiable from fhe observed dafa and suppose 
fhe feasible regions of Flll -®oi and Bq 2 , respectively. If is nol confained in fhe 

infersecfion of i?oi and B 02 , then Q and hence T-Lo are nol compafible wilh fhe observed dafa dislribufion. 
If olherwise, so fhaf Q is compafible wilh fhe observed dafa dislribufion, we fhen fesf hypolhesis Q under 
fhe assumplion fhaf hypolhesis Q holds. Note lhal, under hypolhesis Q, /Tlll and idenlifiable. 

Consequenlly, is idenlifiable. Suppose Ihe feasible region of under Ihe conslrainl Q is B 12 . 

If is nol confained in B 12 , we conclude lhal Q is nol compatible wilh Ihe observed dala dislribufion 
under Ihe conslrainl Q and hence rejecl T-Lq. If olherwise, we conclude lhal T-Lq is compatible wilh Ihe 
observed dala dislribufion. 

Algorilhm[T] generalizes Ihe procedure described above lo general multiarm Irials. Theorem [T] slates Ihe 
asymptotic oplimalily of Algorilhm[T] The proof is provided in Ihe Supplemenlary Materials. 

Algorithm 1 A step-down algorilhm for testing Ihe global null hypolhesis T-Lq 

1. Set fc = 0 

2. For z = k,... ,m 

obtain the feasible region (under the maintained assumptions) B^^ for Theorem 

13 

3. If n = 0 

z=k,...,m 

reject Flo; report k; stop 

else 


set 


'Qk — k 


m 

£)k J^m + l — k 


( 6 ) 


4. If A: = m 

fail to reject Bo and stop 
else 

set A; = A: -|- 1 and go to Step 2 


Theorem 1: The test given by Algorithm [T] is sharp for testing Bo- In other words, it is asymptotically 
optimal for testing Bo as it maximizes power given the observed data distribution. 

To derive the feasible regions (5^2; k = 0,... ,m, z = k,... ,m) in Algorithm we introduce notation 
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building on Horowitz and Manski (19951. Let Qp{-) denote the distribution (function) of outcome Y among 


members of subgroup G who receive treatment z, and (5a;(-) be a degenerate distribution function localized 
at X. As Y is binary, Qg{-) is a Bernoulli distribution with mean mg{z): Qg(-) = (1 — mg(z)}So(-) + 
mg(z)di(-). To compress notation, we write Qg(-) as Qg. Also let L\{Q) and U\{Q) be functionals that 
map a distribution function Q to the corresponding distributions truncated at the lower A quantile and upper 
A quantile, respectively. Theoremgives the formula for feasible region Bi^. 

Theorem 2: Suppose that the observed data distribution is known and holds for all /c < 1. Let 
g = and g = U g he the coarsened principal stratum whose members would survive 

if assigned to treatment z but would die if assigned to treatment I — 1. The feasible region of is 


Biz — 


ydL^^Q^a), / ydU^AQl 


(V) 


where 

mean 


P[G = g\G^^z\ = Pf 


g ! \ X] Py\ Qg ^ Bernoulli distribution with 


mg{z) = m{z) — 




1- E Pit • 

geUi-i 


in which m{z) = P\Y = 1\ Z = z,S = 1]. 


Intuitively, the bounds of Bi^ are obtained by assigning the smallest/largest ujg portion of observed 
outcome values in distribution Qg to principal stratum g. The proof is in the Supplementary Materials. 

Remark 1: Algorithm[T]is a “step-down” procedure in the sense that the hypothesis T-Lq is decomposed into 
a series of hypotheses where the first hypothesis concerns the first stratum the second hypothesis 

concerns the second stratum DL™ conditioning on the first hypothesis, and so on. 


3.2 Bayesian procedures 

We have so far developed a sharp test for problem In practice, however, sampling uncertainty must be 
taken into account when making statistical inference. Here we introduce a Bayesian procedure to estimate 
the posterior probability that Pq is not compatible with the observed data distribution. The Bayesian method 
produces multiple samples of the posterior distribution, thereby reflecting randomness in observed data. 

Let p(s, y \ z) = P{S = s,Y = y \ Z = z) and p(-, • | z) = (p(l, 1 | z),p(l, 0 | 2;),p(0, t| z)), where 
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t indicates that Y is undefined when S = 0. Define p = {p{-, • | 0), ... ,p(-, • | m)). Under independenf 
Dirichlef priors over fhe observed distributions p{-,- \ z),z = 0, ..., m, if is easy fo sample from fhe 
posferior disfribufion via conjugacy. We propose fo use Algorifhm fo calculafe fhe posferior probabilify 
fhat T-Lo is nof compatible wifh fhe observed dafa disfribufion. 

Algorithm 2 A Bayesian procedure for festing T-Lq 


1. Place an independenf Dirichlef prior Dir{azz+i, 0 (sz+ 2 , asz+s) on p{-, ■ \ z), z = 0,... ,m. 

2. Simulafe samples from fhe posferior disfribufions, which are independenf Dirichlef 

disfribufions 

Dir{a3z+i + n3z+i,o-3z+2 + ^ 1 - 32 + 2 , «32+3 + ^^- 32 + 3 )) 2 : = 0,..., m, 

N N 

where nsz+i = E = l,Yi = l,Zi = z),n3z+2 = E HSi = l,Yi = 0,Zi = z),n3z+3 = 
i=l i=l 

N 

Y.i{Si = ^,Zi = z). 
i=\ 

3 . Run Algorifhm[T]wifh each of fhe posferior samples satisfying fhe following inequalifies: 

P{S =l \ Z = m)>---> P{S = l\Z = l)> P{S = 1\Z = 0) (8) 

Nofe Q characferizes fhe sef of observed dafa disfribufions arising from fhe pofenfial oufcome model 
defined by Assumptions [T]-[^ 

4. Reporf fhe proporfion of posferior samples wifh which 0 is rejecfed. 


Remark 2: The sfep-down procedure in Algorifhm[T]has a similar sfrucfure fo fhe sequenfial fesfs for nesfed 
hypofheses discussed by Rosenbaum ( |2008[ ). His procedure has affracfive Frequenfisf properties since if 
confrols fhe fype I error rafe wifhouf resorting fo mulfiplicify adjusfmenf. However, wifh his mefhods one 
proceeds fo fhe nexf sfep if fhe currenf hypofhesis is rejecfed whereas in our proposal, one proceeds if fhe 
currenf hypofhesis is not rejecfed. Moreover, in his confexf, fhe paramefers of inferesf are idenfifiable. Hence 
Rosenbaum’s resulfs are nof direcfly applicable fo our case. 


4 Testing clinically relevant treatment effects in a multiarm trial 

If a non-zero frealmenf effecl is found using Algorilhm|^ a nafural quesfion arises as fo whefher fhe freaf- 
menf effecl is clinically meaningful. Suppose fhe margin of clinical relevance is Aq such fhaf a frealmenf 
effecl smaller lhan fhis would nof mailer in pracfice, and also suppose fhaf fhe freamfenl effecl is clinically 
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meaningful only if a higher treatment level corresponds to a higher mean potential outcome. It is desirable 
to consider the following testing problem: 


'Ho^c-^{zi,Z2-,g)</^o,'ig,zi>Z2 vs > Z2 s-f- A(zi, 22;5) > ^o, ( 9 ) 

where the letter “c” in ?fo,c is short for “clinical relevance.” Similar to Q, Q is a testing problem on 
non-identifiable parameters. However, as the null parameter space is a non-degenerate region in the domain 
of ih® step-down procedure developed in Sectionis not applicable. Instead, we define Amax to be 

the largest A{zi, Z2\ g) that appears in T-Lq^c- Amax = max A{z\, Z2] g)■ Q can then be rewritten in an 

g,zi>z2 

equivalent form using Amax- ^o,c : Amax < Aq vs 'Ha,c '■ Amax > Aq. The following lemma says 
the testing problem Q can be translated into the identification problem on Amax- 

Lemma 1: Suppose the sharp (large sample) lower bound for Amax is Amax,sib- A sharp test would reject 
no,c if and only if Amax, sib > Aq. 

As Amax is a function of tJ-m-i, in general, identifying Amax,sib involves minimizing Amax subject to 
the constraints on t^m-i imposed by the observed data distribution. Theorem]^ below says that the feasible 
region of /J-m-i is a convex polytope, defined as an intersection of finitely many half spaces. Consequently, 
this optimization problem can be translated into a linear programming problem and efficiently solved with 
off-the-shelf software. See Algorithm 1 in the Supplementary Materials for more details. 

Theorem 3: Given the observed data distribution, the feasible region of fJ-m-i is a subspace in [0, 
subject to the following constraints: 

^ Pgtig = m{z), z = 0 ,..., m - 1 ; 
max( 0 ,m(z)^ < min (1 - m(z)), z = m, 

where p^ is identifiable from data under Assumptions and(see Lemma 1 in the Supplementary Materi¬ 
als). In particular, the feasible region of pim-i is a convex polytope. 

To incorporate statistical uncertainty, one can use Bayesian analysis methods to derive a credible interval 
for Amax, sib- Specifically, one runs Sfeps 1-4 in Algorithmto get multiple posterior samples that satisfy 
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the constraint Q, and then produces a percentile based credible interval for i^max^sib based on the posterior 
samples. One may also estimate the posterior probability of rejecting ?fo,c for any given positive value Aq 
with these posterior sample draws. 


5 Marginal credible intervals for a given contrast 


If a clinically non-trivial treatment effect is found, then it is desirable to identify the principal strata and 
treatment arms that correspond to this treatment effect. In this case, the marginal feasible regions and 
associated credible intervals for A( 2 :i, Z 2 ', g) are of interest. 

If the observed data distribution was known, then the feasible region for A(zi, Z 2 ] g) can be obtained 
from the feasible regions for and . Specifically, we have the following theorem. 


Theorem 4: Suppose the observed data distribution is known, and are feasible 

regions for and g,g^, respectively. Then we have the following results. 


1. For Z = Zl,Z 2 , BM{g),z 


J ydLpziQ^)J ydUpiiQ^) 


2. The feasible region of A{zi, Z 2 ] g) is 


f ydLp^/ iQ"^)-1 ydUp^2 ydUp^i ydLp^2 ) 


In practice, credible intervals for A(^i, Z 2 ', g) can be constructed from posterior sample draws 
These posterior draws may also be used to estimate the posterior probability of rejecting the null hypothesis 
?fo,m : A{zi, Z 2 ', g) < Aq, where the letter “m” in 2fo,m is short for “marginal.” 


Remark 3: We remark that even if the observed data provide evidence for the existence of non-zero/non- 
trivial treatment effects, it is possible that they do not contain information on the specific principal sfrafa 
and freafmenf arms fhaf correspond fo fhese freafmenf effecfs. Moreover, unlike fhe case for mulfiarm frials 
wifhouf fruncafion by deafh, fhis can happen even wifh an infinife sample size. 

We illusfrafe our poinf wifh fhe following numerical example. Consider a fhree-arm frial such fhaf 
vtlll = ttdll = vtddl = 0.3, ttddd = 0.1, m(0) = 0.3, m(l) = 0, m(2) = 0.5, where -Kg = P{G = g). 
In fhis case, = 0.3 and = 0. If follows fhaf A^ax = max(0, - 

Fdll)- assume fhaf fhe sample size is infinife so fhaf we know fhe observed dafa disfribufion. Figure 
shows fhe joinf feasible region of — Fllli Fdll ~ Fdll) green shaded area). Suppose 

fhaf fhe margin of clinical relevance Aq is 0.1, fhen fhe accepfance region for null hypofhesis Tfo.c is the 
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lower left area of the blue contour line. As there is no intersection between the feasible region of — 

/^LLL> acceptance region for ?Ao,c, one may conclude that T-Lo,c should be rejected. 

Alternatively, one can see from the contour lines of A^ax that the sharp lower bound for Amax is 0.25. As 
Aq is smaller th 


i^DLL M-DLL 



M-lll M-lll 


Figure 2: Feasible region of — t^LLL-i f^'oLL ~ iAdll) (greon shaded area). The colored lines are 

contour lines of Amax- The sharp lower bound of Amax is obtained at the red point. 

However, by projecting the joint feasible region of {Alll~ Alll^ Adll~ Adll) individual axises, 

one concludes that the marginal feasible regions for All ~ All Adll ~ Adll [ii; 1] ■ both 

of the marginal feasible regions contain values that are smaller than Aq, the data contain no information on 
the specific contrast that corresponds to the overall treatment effect. 

6 Data Illustrations 

6.1 Application to the HIV Vaccine Trials Network 503 study 

The HIV Vaccine Trials Network (HVTN) 503 HIV vaccine study was a randomized, double-blinded, 
placebo-controlled Phase Ilb test-of-concept clinical trial to investigate the efficacy and safefy of an ex¬ 
perimental HIV vaccine. The same vaccine was also evaluated in a different population in an earlier HVTN 
502/Step trial. Starting January, 2007, the HVTN 503 study enrolled 800 HIV negative subjects and ran- 
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domized them to receive three doses of either the study vaccine or a placebo. The ratio of vaccine to placebo 
assignment was 1:1. Enrollment and vaccinations were halted in September 2007, but follow-up continued, 
after the HVTN 502/Step trial met its prespecified non-efficacy criteria. Details of this study can be found 
in [Gray et al.| ( 201l| 2014[ ). 

In our analysis, we compared CD4 counts among participants within the same principal stratum defined 
by their full potential infection statuses. Due to the early stopping of vaccinations of the trial, a majority 
of participants in the HVTN 503 trial were not fully immunized. When enrollment was stopped, 400 par¬ 
ticipants in the HVTN 503 trial were assigned to the experimental vaccine group. Of them, 112 received 
one injection, 259 received two injections, and only 29 received all three injections. Hence we considered 
the dosage of experimental vaccine as the treatment arm Z, where Z = 0 for all subjects in the control 
group. As the trial was stopped administratively, and the time a participant entered this trial was unlikely to 
affect the potential outcomes of interest (CD4 count), it is reasonable to assume that the treatment arms were 
randomized. Furthermore, since there were only 3.6% of participants who received all three experimental 
vaccines, we code Z = 2 for all participants who receive two or more experimental vaccine injections. 

A total of 100 subjects were infected during this trial. We defined each subjecf’s “median CD4 count” 
(the outcome of interest) as their median CD4 count measured between their confirmatory HIV testing visit 
and the end of follow-up or start of antiretroviral treatments. We also dichotomized CD4 count at 350 
cells/mm^ and 200 cells/mm^ as they have been used in previous United States Department of Health and 
Human Services (DHHS) guidelines for initiating antiretroviral treatment. Note that the outcome measure 
is only measured for infected subjects. As 87.5% of the study subjects were uninfected, an intent-to-treat 
analysis with imputation for missing CD4 count values is likely to have very low power for detecting any 


treatment effects ( Gilbert et al.[ 20031. Hence SACEs are of interest for analyzing this trial. 

Table 2 in the Supplementary Materials summarizes the observed data for the study participants. There 
were 7 infected participants who had no CD4 count measurements after their confirmatory HIV testing visit. 
We made the missing completely at random (MCAR) assumption and left them out of our analysis below. In 
treatment arm 0, 1, 2, the mean number of CD4 counts available were 5.69, 5.94 and 5.57, respectively; the 
mean length of time from the confirmatory HIV testing visit to the first CD4 count measure were 26 days, 
25 days and 32 days, respectively, and the mean time spacing between CD4 count measurements were 127 
days, 146 days and 134 days, respectively. 

Presumably there was little interaction among HVTN 503 subjects so that the SUTVA was plausible. 
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Subsequent analyses of the HVTN 502 and HVTN 503 data suggested that although not possible to directly 
cause HIV infections itself, the investigational vaccine may increase susceptibility to HIV infection for re¬ 
cipients (Gray et al. 2011 2014| ). Given the negative results on the primary efficacy endpoints, members of 
the HVTN 503 Protocol Team whom we consulted agreed that it is reasonable to make the reverse mono¬ 
tonicity assumption such that experimental vaccine did not help prevent HIV infection for any participant 
in the study population. The empirical infection rates in the Z = 0,1, 2 arms were 9.25%, 16.07% and 
15.63%, respectively. Thus, the reverse monotonicity assumption seemed acceptable, and we proceeded 
with our analysis under this assumption. 

Table [T] summarizes the analysis results. The simultaneous testing method estimates the posterior prob¬ 
ability of existence of an overall non-zero treatment effect, while the marginal testing method estimates 
the posterior probability that an overall non-zero treatment effect can be claimed along with the specific 
treatment arms and principal strata that correspond to this treatment effect. These posterior probabilities 
were high, suggesting evidence of a non-zero treatment effect on median CD4 falling below 350 or 200 
cells/mm^. The 95% credible intervals for lower bound on ^max provide information on the magnitude 
of vaccine effects. For example, results in Tables [T] show that there exists at least one basic principal stra¬ 
tum and treatment comparison for which the vaccine reduces the probability of median CD4 count < 200 
cells/mm^ by at least 0.026, but we were not able to ascertain the specific basic principal strafum and treaf- 
ment comparison that corresponds to this effect. The reason for this is two fold. Firstly, because of the 
non-identifiability of the SACEs, if the effect size is too small, one may fail to identify the specific causal 
contrast that corresponds to a clinically relevant treatment effect even with an infinite sample size. Secondly, 
our proposed methods may deliver more conclusive results if the sample size is large enough. For example, 
if the sample size was 3000 (which was the estimated sample size in the HVTN 503 trial protocol) and the 
observed frequencies P(S = \ \ Z = z) and P(y = l\ Z = z,S = l) had remained the same, then the 
95% credible interval for the contrast Flll “Flll would have been [0.057, 0.186], which would imply that 
compared to the placebo, receiving two or more injections of the experimental vaccine is clinically effective 
for reducing the possibility of very low CD4 cell counts (200 cells/mm^ or less) among subjects who would 
get infected regardless of which treatment arm they were assigned to. 

We conclude this part with several caveats. First, the median CD4 count is a non-traditional endpoint 
for HIV vaccine efficacy trials, and it may not be completely comparable between treatment groups because 
of differences in the number and timing of CD4 measurements. Second, we have dichotomized CD4 count 
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Table 1: Posterior probabilities of finding a non-zero overall treatment and posterior credible intervals for 
lower bounds on A^ax (the maximal treatment effect over all principal strata and treatment comparisons) 
for the HVTN 503 trial 


Methods 

Posterior probability of a 
non-zero treatment effect 

95% credible interval for 
lower bound on A^ax 

Outcome:median CD4 >350 

Simultaneous 

0.882 

[0.000, 0.346] 

Marginal 

0.651 

[0.000, 0.341] 

Outcome:median CD4 > 200 

Simultaneous 

0.996 

[0.026, 0.260] 

Marginal 

0.973 

[6 X 10-^ 0.245] 


in our analysis, which results in loss of information. Third, we have made the MCAR assumption for the 
missing values in CD4 count measures, which is hard to verify for this data set. Fourth, as pointed out by 
some authors (e.g. Pearl[ 20111, under the principal stratification framework we have taken here, the vaccine 
effect estimates are only relevant for the subgroup of subjects who would get infected under at least two 
dosage levels, which constitutes only a small fraction of the population. Finally, a reduction of 0.026 in the 
probability of median CD4 counts < 200 cells/mm^ may not be considered clinically important given the 
earlier finding that the vaccine increased HIV acquisition in the study population. 


6.2 Application to survey incentive trials 

Faced with declining voluntary participation rates, there is now a consensus that incentives are effective 
for motivating response to surveys ( [Singer and Kulka 2002[ Singer and Ye[ 2013] ). There is, however, 
controversy on how incentives affect the quality of data collected. Social exchange theory suggests that 
by establishing an explicit exchange relationship, incentives not only encourage participation in surveys. 


but also encourage respondents to provide more accurate and complete information (Davem et al. 20031. 


However, current experimental studies have mixed findings on this hypothesis (Singer and Kulka 2002| 
Singer and Ye||2013] ). 

These experimental studies directly compare response quality in different incentive groups without ac¬ 
counting for the problem of truncation by response. Here the treatments Z are the levels of incentive, the 
intermediate outcomes S are the responses to the surveys, and the final outcomes Y are measures of survey 
quality. Although some researchers realize that people persuaded to participate through the use of incentives 
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will have less internal motivation for filling out the survey thoroughly (e.g. Davern et al. 20031, few, if any, 
separate this group of people in their analyses from those who would participate in the survey regardless of 
incentive levels, rendering their results subject to selection bias. Furthermore, arguably the response quality 


is undefined for survey non-respondents. Thus as argued by Rubin (2006) and others, the naive comparison 
is not causal as it compares different groups of people at baseline. Instead, for two-arm trials, the SACE is 
of interest as the subgroup whose members would respond regardless of the level of incentive is the only 
group for which both of the potential outcomes are well-defined. This holds similarly for multiarm trials. 
Moreover, it is very common that such randomized experiments have multiple incentive groups ([Singer and 


Kulkaj 2002[ Singer and Ye 20131. Hence the methodology introduced in this paper, and more generally. 


identification and estimation methods for SACEs in multiarm trials are especially relevant. 

Eor example, Curtin et ah] (20071 used data from the Survey of Consumer Attitudes (SC A) conducted by 
the University of Michigan Survey Research Center to investigate whether efforts to increase the response 
rate jeopardize response quality. Their analysis was based on a random digit dial telephone survey conducted 
between November 2003 and Eebruary 2004. In each of the four months, eligible samples were randomly 
assigned to one of three experimental conditions: advance letter without an incentive, advance letter plus 
$5 incentive and advance letter plus $10 incentive. The same follow-up procedures, including promised 
refusal conversion payments are used in all three groups. The measure for response quality in such studies 
are inevitably subjective; they can be binary (e.g., “mostly compete” vs “partially complete,” or whether 
a particularly important question is answered) or continuous (e.g. percent of missing items). As we don’t 
have access to this data set, below we only discuss the validity of our assumptions. 

The SUTVA is reasonable as these are random digit dial samples from the coterminous United States. 
The monotonicity assumption is also plausible. As argued by survey sampling experts, incentives will 
motivate response as they compensate for the relative absence of factors that might otherwise stimulate 
cooperation ( Singer and KuUcaj 2002), so that individuals who would respond with a lower incentive would 
also respond if offered a higher incentive. Empirical evidence in this study also supports this assumption: 


the response rates for the three experimental groups were 51.7%, 63.8% and 67.7% (Curtin et al. 20071. 


7 Discussion 

In randomized trials with truncation by death, the average causal effects in basic principal strata are often of 
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interest as they provide causally meaningful and interpretable summaries of the treatment effects. However, 
for trials with multiple treatment arms, there are usually many such causal contrasts that are of interest to 
investigators. In this article, we consider testing and estimation problems on the basic principal stratum 
causal effects. Specifically, we propose three scientific quesfions fo undersfand fhe overall frealmenf effecl 
and individual principal sfrafum causal effecfs. We fhen develop novel inference procedures fo answer fhese 
questions, and show fhaf fhe proposed procedures have desirable asympfofic properfies. 

Compared fo analyzing a multiarm frial in a sfandard setting, fhe main difficully infroduced by fruncafion 
by deafh is fhaf fhe causal esfimands are nol identifiable. In fhis case, we show fhaf compared fo marginal 
mefhods, fhe (ANOVA fype) simulfaneous inference mefhods provide more power for fesfing fhe overall 
frealmenf effecl, and fhe advanlage remains even wilh an infinile sample size. These resulls demonslrale 
fhe imporlance of addressing bolh joinl and marginal hypolheses in a causal analysis of multiarm Irials 
wifh fruncafion by deafh. This idea may be applied fo analyse mulfiarm frials in ofher sellings in which 
fhe causal esfimands are nof idenlifiable. For example, in multiarm Irials wilh non-compliance, existing 
mefhods consider fhe causal conlrasls separalely (Cheng and Small 2006| [Long ef al.[ 2010| ). Allhough 
resulls oblained wifh such mefhods are valid, fhey are offen nol informalive, especially in fhe case where 


Ihere are more lhan Ihree Irealment arms ( Long ef ah] 20101. In Ibis case, a simulfaneous inference melhod 
may yield a greater posterior probabilily of claiming an overall Irealment effect and the joint posterior 
credible intervals are less likely to contain the origin. 

In analyzing a multiarm trial with truncation by death, researchers may dichotomize the treatment vari¬ 
able to simplify an analysis, especially in settings where the multiarm trials consist of a placebo arm and 
several dosage groups for an active treatment. One such example is the HVTN 503 study, where the treat¬ 
ment groups 1 and 2 can be considered as different versions of the experimental vaccine. However, as noted 
by Hernan and VanderWeele ( |2011| ), results from analyses that combine treatment arms in this way may not 
be generalizable to other population as the causal effect of a compound treatment depends on the distribution 
of treatment versions in the target population. Moreover, because of the non-identifiability of SACEs, one 
may fail to find an overall frealmenf effecl fhaf could have been found by applying fhe proposed simullane- 
ous inference procedure. For example, for fhe HVTN 503 sludy, if one were fo collapse fhe active frealmenf 
groups info a single compound frealmenf, fhen fhe 95% credible inlervals for fhe SAGE corresponding fo 
Ibis compound frealmenf would be [0.000, 0.253], wilh which one could nol claim any clinically relevanf 
frealmenf effecl. 
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To account for sampling uncertainty in the observed data distribution, we use Bayesian analysis methods 
to obtain posterior samples of identifiable quantities p. An alternative Bayesian procedure to our method 
involves posterior sampling on the mean potential outcomes p-rn-i- This alternative approach would directly 
yield the posterior rejection rate of T-Lq and credible intervals for I^max^sib without resorting to techniques we 
have introduced. However, as p-m-i is not identifiable from the observed data, it turns out that the posterior 
estimates of Amax are extremely sensitive to the prior specification on refer interested readers to 

Richardson et al.| ( [2011[ ) for a further discussion of this issue. 

The problem we consider here is similar to an instrumental variable analysis in that both problems can 
be analysed under the principal stratification framework. When the exposure variable in an instrumental 
variable analysis is binary, the exclusion restriction assumption is closely related to the null hypothesis in 
the truncation by death problem, namely the causal effect in the always-survivor group is zero. Hence the ap¬ 
proach we develop here may be used to partially test the exclusion restriction assumption of an instrumental 
variable model. 

There are several possible extensions to our framework. For example, we have restricted our attention to 
binary outcomes in this article. We are currently exploring extensions to deal with continuous and categor¬ 
ical outcomes. In addition, covariate information may be employed to sharpen bounds on SACEs. Another 
possible extension is to introduce sensitivity parameters for better understanding of the causal effects of 
interest. The tests and bounds we have developed here correspond to extreme results of corresponding 
sensitivity analyses. 
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Supplementary Materials for “Causal Analysis of Ordinal 
Treatments and Binary Outcomes under Truncation by Death” 

Linbo Wang, Thomas S. Richardson and Xiao-Hua Zhou 

1 Algorithm for identifying Amax,sib 


See Algorithm [ST] 


Algorithm SI An algorithm for identifying Amax,sib 

1 . Solve the following linear programming problem: 
minimize a subject to: 

^ pIpI = m{z), 

z = 0 ,..., m — 1; 

max (0, m{z) - ph^L) < 

(1 , 

z = m; 


- P-f < a, 

yg,zi > Z2; 


0 < < 1 , 

'^g,z 

2. Report the value of the linear programming problem above as Amax,sib 



2 Simulation studies 

We now use a hypothetical example to illustrate the advantage of the simultaneous inference procedures 
proposed in Section 3 and 4 in the main text for testing the overall treatment effect. Let the comparison 
method be the approach that considers each A{zi, Z 2 \ g) separately, and it accepts or rejects the null based on 
the marginal feasible regions of A(zi, Z 2 ',g)- With the comparison marginal testing method, one rejects the 
hypothesis Tio only if at least one of the marginal feasible regions excludes 0. In other words, the comparison 
method rejects Tio if the observed data not only provide evidence for existence of a non-zero treatment 
effect, but also contain information on the specific principal strata and treatment arms that correspond to 
this treatment effect. As explained in Remark 3 in the main text, this generally yields a smaller posterior 
rejection probability. In addition, with the comparison method, one estimates the lower bound on Amax 


1 





Table SI: Observed data counts in a hypothetical example. 


Observed subgroup Counts 

y = 1, S' = 1, Z = 0 ni 

y = 0,S = l,Z = 0 40-m 

S = 0,Z = 0 360 

Y = l,S = l,Z = l 56 

y = 0,S = l,Z = l 24 

S = 0,Z = 1 320 

Y = l,S = l,Z = 2 108 

y = 0,S = l,Z = 2 12 

S = 0,Z = 2 280 


to be the maximal sharp lower bound for all A(zi, Z 2 ; g) that appear in equation (1) in the main text. We 
denote this lower bound as i^rnax.mib^ where “mlh” is short for “marginal lower bound.” One can see from 
the numerical example in Remark 3 in the main text that Ajnax,mib is in general no larger than Amax- This 
is because the comparison marginal estimation method does not use information on the dependence among 
feasible regions of causal contrasts A{zi, Z 2 ; g)- In the simulation studies, we empirically evaluate the 
difference between the proposed simultaneous inference methods and the comparison marginal inference 
methods for testing the overall treatment effect. 

Suppose that we have a three-arm vaccine trial with two vaccine groups and one placebo group, and 
there are 400 subjects in each group. The hypothetical data example is listed in Table where m is a 
parameter taking integer values between 0 and 40. The conditional frequencies m(0), m(l) and m(2) in 
this example are 0.025ni, 0.7 and 0.9, respectively. In our example, there are 10% of the study sample in 
each of the principal strata LLL, DLL, DDL, while the rest belongs to the DDD stratum. 

Results in Figurej^show that for some values of ni, the simultaneous and marginal methods compared 
here yielded similar results. However, in some other cases, the results could be very different. For example, 
when m = 36, the simultaneous testing method estimated the posterior probability of rejecting Tfo to be 
98.8%, compared to an estimate of 4.0% from the marginal testing method. When ni = 20, the simultaneous 
estimation method estimated the 95% credible interval for Amax,sib to be [0.029,0.404], based on which 
one was able to claim a clinically relevant treatment effect at margin Aq = 0.02. The marginal estimation 
method, however, estimated the 95% credible interval for Amax,mib to be [4 x 10“^, 0.363], with which one 
failed to claim a clinically relevant treatment effect at the same margin. 
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Figure S1: Results from analyzing the hypothetical data set in Table S1 The left panel shows the posterior 
probability of rejecting T-Lq using the proposed simultaneous testing method and the comparison marginal 
testing method. The right panel shows the posterior mean (solid lines) and 95% credible intervals (dashed 
lines) for lower bounds on Amax, the maximal treatment effect among all possible basic principal strata 
and treatment comparisons. The red curves correspond to sharp lower bounds obtained using the proposed 
simultaneous estimation method, and the black curves correspond to lower bounds obtained using the com¬ 
parison marginal estimation method. The blue horizontal line corresponds to a clinically meaningful margin 
of 0.02. 


3 Data Table for the HVTN 503 study 


Table [S2| gives the observed data counts for the HVTN 503 trial. 


4 Proofs of theorems and lemmas 

A Proof of Theorem 1 


The proof for the general multi-arm case is very similar to the discussion for the threc-arm case. The only 
non-trivial generalization is for Step 3 of Algorithm 1 in the main text. Instead of checking the pairwise 
intersections of {B^z] z = k,, m), we check their joint intersection. This relies on the observation that 
if we let g = then cOg = 1 and the feasible region for B^k is a one point set {f ydQg}. 

Consequently, 

n Bkz / 0 (SI) 

z=k,...,m 


implies that 


Bkz\ Cl Bkz2 7^ 0)'^^i > ^2 ^ k. 


(S2) 
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Table S2: Observed data counts in the HVTN 503 trial. Z denotes the treatment arm, S denotes the infection 
status, and Y is the dichotomized outcome of CD4 count. Y = * denotes that Y is missing. 


Observed 

subgroup 

median CD4 > 350 cells/mm^ 

median CD4 > 200 cells/mm^ 

Y 

= 1,5 = 

1,Z = 

0 

19 

29 

Y 

= 0,5 = 

1,Z = 

0 

14 

4 

Y 

= *,5 = 

1,Z = 

0 

4 

4 


5 = 

o,z = 

0 

363 

363 

Y 

= 1,5 = 

1,Z = 

1 

12 

16 

Y 

= 0,5 = 

1,Z = 

1 

4 

0 

Y 

= *,5 = 

1,Z = 

1 

2 

2 


5 = 

o,z = 

1 

94 

94 

Y 

= 1,5 = 

1,Z = 

2 

34 

44 

Y 

= 0,5 = 

1,Z = 

2 

10 

0 

Y 

= *,5 = 

1,Z = 

2 

1 

1 


5 = 

o,z = 

2 

243 

243 


Note there are only m — k pairs of comparisons involved in dST] ), compared to {m + 1 — k){m — k)/2 pairs 
of comparisons in ([S^. 


B Proof of Theorem 2 

To prove Theorem 2, we note that the assumptions of Theorem 2 and the observed data distribution impose 
the following constraints on 

E PIQI + PIQI+ E pIQh (S3) 

//f= = (S4) 


where denotes the distribution of outcome Y in treatment arm 2 . To simplify (|S^ and (S4i, we use 


the following lemmas, which say that both the proportions of basic principal strata Pg and the means of 
Bernoulli distributions {Qg,g G 2 ; > M(g)) are identifiable. Proofs of these lemmas are left to the 
end of this subsection. 


Lemma 2: The proportions of basic principal strata, namely {pg;g G flrn-i,z > M-ig)) are identifiable 
from fhe observed dafa. 
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Lemma 3: Suppose that (6) in the main text holds for all k < I, then G > M.{g)) are 

identifiable from the observed data. 


As the Bernoulli distribution is uniquely determined by its mean the constraints on can be 


simplified as 


Qi = ^lQl+ E ‘4Qh 


(S5) 


where Qg a Bernoulli distribution with mean mg{z). Applying Imai (20081’s results to (^l, we have 


Biz — 


ydL^ziQ^g), / ydU^zlQ^g) 


This completes the proof of Theorem 2. □ 

Proof of Lemma |2] 

Proof. Let vr^ = P{G = g\Z = z). Following Assumption 2, tt^ is independent of treatment arm z and 
hence can be written as -Kg. Under Assumption 3, we have the following equations: 


P{S = 1\Z = 0) = TT^m+l, 

Pi^S = \\Z = 1) = TT^m+l + TTDL'^i 


P(^S — '^\Z — z'j — TT^m+l -f ■ ■ • T TT J^z J^m+l — z ^ 


(S6) 


P{S = l\Z = m) = TTim+i + ... + 

1 = TTlm.+ l + . . . + TTDm+1 ■ 


It can be shown that there exists an unique solution to equation (|S6|) and hence {-Kg, g G Q,m) are identifiable 


from equation (^i. It then follows that (p^; p G z > M.{g)) we, also identifiable. 


□ 


Proof of Lemma |3] 

Proof. As (6) in the main text holds for all k < I, we only need to show that g"^^- is identifiable from the 
observed data. We show this by applying the induction method on Ad (p). 
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Base case: if M{g) = 0, then g,g - = = P{y = MZ = 0,5" = 1) by the monotonicity 

assumption (Assumption 3). 

Inductive step: suppose that is identifiable from the observed data for all principle strata g such 

that M.{g) < k. Following the monotonicity assumption (Assumption 3), we have the following identify: 


P{Y = l\Z = k + l,S = l)=Y, 


where the last step in ( |S7| ) follows from the working hypotheses. 

Following Lemma 1^ iPg]g £ ^k) and are identifiable from fhe observed dafa. Following 


fhe induction hypofheses, G Q^) are also identifiable. Consequenfly, is identifiable 


fc+i 


M(g) 


from ( [ST] ). In ofher words, for principle sfrafa g such fhaf M.{g) = k + 1, gg - is also identifiable from 
the observed data. 

By the induction principle, we have finished our proof. □ 


C Proof of Theorem 4 


Theorem 4 is a direcf consequence of fhe following lemma: 

k 

Lemma 4: Lef be a mixfure of k Bernoulli disfribufions /i,..., Z^: h = Yh where fhe mixing 

i=i 

proportions aj,j = 1,..., /c are known. Lef P, Pi,..., be fhe probabilify of a posifive oufcome under 
h, fi,..., fk respecfively, fhen 


k \ i / i 

max I 0, P - ajj <Y < min ^ aj,P 
i=Pi / i=i \i=i 


Lemma |4] is a generalization of Lemma 1 in|Cheng and Small|(|20061) and can be proved by solving fhe 

l k 

linear programming problem of minimizing or maximizing Y oijfj subjecf fo consfrainfs P = Y ^jPj- 

i=i i=i 

□ 
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